Introduction

The NYC flights dataset contains information about flights coming in and out of NYC. We can inspect the first few elements here.

library(nycflights13); library(rvest); library(tidyverse)
head(flights)

Let’s say we would like to visualize this dataset using leaflet. The problem is that we have a bunch of airport codes but we don’t know what city they’re in It would be nice if we had it mapped out for us (e.g. LAX = Los Angeles) but unfortunately that is not the case.

Good news is that information is easily available on the web! One doesn’t have to look to far to see a table of airport codes and the city they belong to.

Looking at the site http://www.leonardsguide.com/us-airport-codes.shtml we see a nice table already made for us.

But how do I get that data though?

We can employ R’s awesome “rvest” package for web scraping whatever we’d like.

Webscraping is relatively intuitive. There only a few functions we need:

WebScrape the Airport codes

Let’s see what it looks like in action

html.scrape <- read_html("http://www.leonardsguide.com/us-airport-codes.shtml")
Now the tricky part is seeing which nodes contain our data. Here it would be wise use chrome dev tools to see which node contains the info. Here it looks like the

node contains the data.

codes <- html.scrape %>% html_nodes("td") %>% html_text()

Let’s take a moment to recap:

But it is still not in a format we would like. It should be a dataframe with state, city, and airport code as columns instead of a long vector. Now comes the power of the tidyverse.

codes <- as.data.frame(codes)
show(as_tibble(codes))
## # A tibble: 480 x 1
##    codes                           
##    <chr>                           
##  1 Alabama                         
##  2 AL                              
##  3 Birmingham International Airport
##  4 BHM                             
##  5 Dothan Regional Airport         
##  6 DHN                             
##  7 Huntsville International Airport
##  8 HSV                             
##  9 Mobile                          
## 10 MOB                             
## # ... with 470 more rows

Looking at the above, at first glance it would seem difficult to get this into the data frame we want. We also can see one observation that we need to fix…

You can see that some airports have the word “Airport” attached to their name, while others do not. This makes it hard to distinguish which is an airport and what is not. We know that any 3 letter airport code must have a preceding airport in the element above it. Let’s write code to write the word “Airport” to any element that needs it.

Example:

We see that MOB’s airport name is “Mobile” and should be “Mobile Airport”

#loop through dataframe
for(i in 1:nrow(codes)){

  #find 3 letter airport codes
  if(nchar(codes[[i,1]]) == 3){
    
    #check if the string above it contains the word "Airport"
    if(!str_detect(codes[[i-1,1]], "Airport")){
      
      #if it doesn't then write the word "Airport" at the end
      codes[[i-1, 1]] <- paste(codes[[i-1,1]], "Airport", sep = " ")
    }
  }
}

Now we can take full advantage of R’s separate function.

#use R's separate function as we need to separate 3 times

#first regex splits anything thats 2 letters or 3 letters or contains "Airport"
#now you see why we needed to add the word Airport to a few elements
regexp1 <- "(?=(^...$|^..$))|(?=.*(?<=Airport))"

#splits only airports
regexp2 <- "(?=.*(?<=Airport))"

#splits only 3 letters
regexp3 <- "(?=(^...$))"

#now we perform a sweeping data clean
codes <- codes %>%
  separate(col = 1,into = c("State", "StateAbbr"),
           sep = regexp1,extra = "merge") %>% 
  separate(col = 2,into = c("StateAbbr", "Airport"),
           sep = regexp2,extra="merge") %>% 
  separate(col = 2, into = c("StateAbbr","AirportCode"), 
           sep = regexp3, extra = "merge") %>% 
  mutate(
    StateAbbr = lead(StateAbbr), 
    AirportCode = lead(AirportCode, n = 3),
    Airport = lead(Airport, n = 2)
    ) %>% 
  drop_na() %>% 
  na_if("") %>% 
  fill(State,StateAbbr)

Now look at that gorgeous dataframe

show(as_tibble(codes))
## # A tibble: 185 x 4
##    State   StateAbbr AirportCode Airport                                        
##    <chr>   <chr>     <chr>       <chr>                                          
##  1 Alabama AL        BHM         "Birmingham International Airport"             
##  2 Alabama AL        DHN         "Dothan Regional Airport"                      
##  3 Alabama AL        HSV         "Huntsville International Airport"             
##  4 Alabama AL        MOB         "Mobile Airport"                               
##  5 Alabama AL        MGM         "Montgomery Airport"                           
##  6 Alaska  AK        ANC         "Anchorage International Airport"              
##  7 Alaska  AK        FAI         "Fairbanks  International Airport"             
##  8 Alaska  AK        JNU         "Juneau  International Airport  "              
##  9 Arizona AZ        FLG         "Flagstaff Airport"                            
## 10 Arizona AZ        PHX         "Phoenix, Phoenix Sky Harbor International Air~
## # ... with 175 more rows

The next step is to join our custom dataframe to the nycflights dataframe.

We run into an issue however… the nycflights isn’t necessarily tidy. We’d ideally like to join on origin or destination, but which one? The problem here is that we can’t because it is not tidy. We need to adjust those columns such that we have the airport code in one column and the “type” (either origin or departure in the other).

tidy.flights <- flights %>% 
  pivot_longer(cols = c(origin,dest), names_to = "airport_code_type",
               values_to = "AirportCode")

tidy.flights <- tidy.flights %>% 
  left_join(codes, by = "AirportCode") %>% 
  drop_na()

Advanced Web Scraping

Not done yet. To see flight data we need the latitude and longitude of the airports. Let’s scrape this site: www.dices.net/movil/airports/airports-United_States-US-1.html

The issue here is that our data is spread across multiple pages or URLs.

#need to use lapply with paste or paste0
coords.html <- lapply(
paste0(
 "http://www.dices.net/movil/airports/airports-United_States-US-",1:103,".html"
),function(url){
    url %>% #pass in URL
    read_html() %>% #get html
    html_nodes("b") %>% #get node data
    html_text() #make it text
})
#returns only the the 5th-84th elements as the others are not needed
coords.html <- lapply(coords.html, function(x){x[5:84]})

#start extracting
coords.vec <- unlist(coords.html) #unlist into vector
AirportCodes <- str_extract(coords.vec, "^...$") #get 3 letter codes
latlong <- str_extract(coords.vec, "(\\d.*)|(-\\d.*)") #get the digits

#combine latlong into a single element
for(i in seq(3,(length(latlong)-1),4)){
  latlong[i] <- paste0(latlong[i],",",latlong[i+1])
}

#extracts the combined latlong elements 
latlong <- str_extract(latlong,".*,.*")

coords <- data.frame(AirportCode = AirportCodes, latlong = latlong) %>% 
  mutate(AirportCode = lead(AirportCode), latlong = lead(latlong, n = 2)) %>% 
  drop_na() %>% 
  separate(latlong, c("lat","long"), sep = ",") %>% #now we can split on the ","
  mutate(lat = as.numeric(lat),
         long = as.numeric(long)) #convert from string

Now we can finally join in Airport Code

final <- coords %>% left_join(tidy.flights, by = "AirportCode") %>%
  drop_na() 

Finally! Look at that dataframe

show(as_tibble(final))
## # A tibble: 636,432 x 24
##    AirportCode   lat  long  year month   day dep_time sched_dep_time dep_delay
##    <chr>       <dbl> <dbl> <int> <int> <int>    <int>          <int>     <dbl>
##  1 CAK          41.1 -81.5  2013     1     1     1147           1155        -8
##  2 CAK          41.1 -81.5  2013     1     1     2020           2030       -10
##  3 CAK          41.1 -81.5  2013     1     2     1147           1155        -8
##  4 CAK          41.1 -81.5  2013     1     2     2024           2030        -6
##  5 CAK          41.1 -81.5  2013     1     3     1151           1155        -4
##  6 CAK          41.1 -81.5  2013     1     3     2021           2030        -9
##  7 CAK          41.1 -81.5  2013     1     4     1156           1155         1
##  8 CAK          41.1 -81.5  2013     1     4     2029           2030        -1
##  9 CAK          41.1 -81.5  2013     1     5     1144           1145        -1
## 10 CAK          41.1 -81.5  2013     1     5     2023           2030        -7
## # ... with 636,422 more rows, and 15 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, airport_code_type <chr>, State <chr>, StateAbbr <chr>,
## #   Airport <chr>

Visualizing the NYC flight destinations

We filter by “dest” and see all the flight destinations from NY. It is important to note that leaflet looks for lat/long columns automatically. Please zoom in and out to see the clusters

library(leaflet)

dests <- final %>% filter(airport_code_type == "dest")

leaflet(dests,height = 350, width = 400) %>%
  setView(lat = 38, lng = -97, zoom = 4) %>% 
  addTiles() %>% 
  addMarkers(clusterOptions = markerClusterOptions())